NOTE: This analysis requires at least 10Gb of RAM to run. It uses large files not included in the repository and many steps can take a few minutes to run.
input_folder <- "raw_input" # Where all the large input files are. Ignored by git.
output_folder <- "results" # Where plots will be saved
output_format <- "pdf" # The file format of saved plots
pub_fig_folder <- "publication"
revision_n <- 1
result_path <- function(name) {
file.path(output_folder, paste0(name, ".", output_format))
}
save_publication_fig <- function(name, figure_number) {
file.path(result_path(name), paste0("revision_", revision_n), paste0("figure_", figure_number, "--", name, ".", output_format))
}
matrix_plot_depth <- 6 # The maximum number of ranks to display
color_interval <- c(-4, 4) # The range of values (log 2 ratio of median proportion) to display
min_read_count <- 100 # The minium number of reads needed for a taxon to be graphed
This analysis used OTU abundance data from the Human Microbiome Project (HMP) (Consortium and others 2012), which used 16S metabarcoding to explore the bacterial microbiome of various human body parts. The function parse_hmp_qiime is a wrapper for extract_taxonomy made for this analysis and downloads the abundance data from the location specified below in the code. It might be made into a more generic “table/matrix parser” in the future. The information on sample characteristics (e.g. body site) is saved in a table called mapping in the taxmap object.
library(metacoder)
v35_data <- parse_hmp_qiime(otu_file = "http://downloads.hmpdacc.org/data/HMQCP/otu_table_psn_v35.txt.gz",
mapping_file = "http://downloads.hmpdacc.org/data/HMQCP/v35_map_uniquebyPSN.txt.bz2")
v35_data
## `taxmap` object with data for 985 taxa and 45383 observations:
##
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6, 7 ... 979, 980, 981, 982, 983, 984, 985
##
## ---------------------------- taxon_data ----------------------------
## # A tibble: 985 × 4
## taxon_ids supertaxon_ids rank name
## <chr> <chr> <chr> <chr>
## 1 1 <NA> Root
## 2 2 1 p Acidobacteria
## 3 3 1 p Actinobacteria
## 4 4 1 p Armatimonadetes
## 5 5 1 p Bacteroidetes
## 6 6 1 p CCM11b
## 7 7 1 p Chloroflexi
## # ... with 978 more rows
##
## ----------------------------- obs_data -----------------------------
## # A tibble: 45,383 × 4,790
## obs_taxon_ids otu_id `700114607` `700114380` `700114716` `700114798`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 452 OTU_97.1 0 0 0 0
## 2 734 OTU_97.10 0 0 0 0
## 3 243 OTU_97.100 0 0 0 0
## 4 735 OTU_97.1000 0 0 0 0
## 5 351 OTU_97.10000 0 0 0 0
## 6 179 OTU_97.10001 0 0 0 0
## 7 734 OTU_97.10002 0 0 0 0
## # ... with 4.538e+04 more rows, and 4784 more variables: `700114710` <dbl>,
## # `700114614` <dbl>, `700114755` <dbl>, `700114715` <dbl>, `700114706` <dbl>,
## # `700114613` <dbl>, `700114803` <dbl>, `700114714` <dbl>, `700114482` <dbl>,
## # `700114439` <dbl>, `700114658` <dbl>, `700114387` <dbl>, `700114441` <dbl>,
## # `700114608` <dbl>, `700111759` <dbl>, `700114603` <dbl>, `700106988` <dbl>,
## # `700113609` <dbl>, `700114612` <dbl>, `700110821` <dbl>, `700114486` <dbl>,
## # `700114481` <dbl>, `700114377` <dbl>, `700114424` <dbl>, `700114712` <dbl>,
## # `700114713` <dbl>, `700114554` <dbl>, `700114606` <dbl>, `700114335` <dbl>,
## # `700114707` <dbl>, `700114442` <dbl>, `700114610` <dbl>, `700114437` <dbl>,
## # `700114327` <dbl>, `700114328` <dbl>, `700114271` <dbl>, `700114223` <dbl>,
## # `700114012` <dbl>, `700114117` <dbl>, `700111749` <dbl>, `700113560` <dbl>,
## # `700114709` <dbl>, `700108534` <dbl>, `700113020` <dbl>, `700114005` <dbl>,
## # `700114212` <dbl>, `700114274` <dbl>, `700114162` <dbl>, `700108165` <dbl>,
## # `700113059` <dbl>, `700114330` <dbl>, `700111035` <dbl>, `700110827` <dbl>,
## # `700114051` <dbl>, `700105883` <dbl>, `700114282` <dbl>, `700114279` <dbl>,
## # `700113019` <dbl>, `700114750` <dbl>, `700114386` <dbl>, `700113502` <dbl>,
## # `700109121` <dbl>, `700111515` <dbl>, `700110308` <dbl>, `700114113` <dbl>,
## # `700113060` <dbl>, `700111520` <dbl>, `700114431` <dbl>, `700114006` <dbl>,
## # `700111243` <dbl>, `700114105` <dbl>, `700114056` <dbl>, `700110426` <dbl>,
## # `700106490` <dbl>, `700110433` <dbl>, `700114385` <dbl>, `700114438` <dbl>,
## # `700111521` <dbl>, `700113016` <dbl>, `700111518` <dbl>, `700111306` <dbl>,
## # `700114440` <dbl>, `700113652` <dbl>, `700114048` <dbl>, `700111449` <dbl>,
## # `700110831` <dbl>, `700114112` <dbl>, `700111824` <dbl>, `700113017` <dbl>,
## # `700114333` <dbl>, `700114004` <dbl>, `700110654` <dbl>, `700114611` <dbl>,
## # `700114210` <dbl>, `700114339` <dbl>, `700111305` <dbl>, `700114170` <dbl>,
## # `700111163` <dbl>, `700110657` <dbl>, `700110173` <dbl>, ...
##
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies
The input data included read abundance for each sample-OTU combination, but we need the abundances associated with each taxon for graphing. The obs function is used here to get a list of row indexes corresponding to OTUs for each taxon. There will usually be multiple OTUs assigned to the same taxon, especailly for corase taxonomic ranks (e.g. the root will have all OTU indexes), so the abundances at those indexes are are summed to provide the total abundance for each taxon.
calculate_abundance <- function(data, col = colnames(data$obs_data) %in% data$mapping$sample_id, col_name = "abundance") {
data$taxon_data[[col_name]] <- vapply(obs(data),
function(i) sum(data$obs_data[i, col]), numeric(1))
return(data)
}
calculate_prop <- function(data, col = colnames(data$obs_data) %in% data$mapping$sample_id, col_name = "abundance") {
data$taxon_data[[col_name]] <- vapply(obs(data),
function(i) sum(data$obs_data[i, col]) , numeric(1)) / sum(data$obs_data[, col])
return(data)
}
v35_data <- calculate_abundance(v35_data)
To get an idea of how the data looks overall lets make a plot showing OTU and read abundance of all the data combined. I will exclude any taxon that has less than 100 reads.
plot_all <- function(data, output_name, seed = 1) {
set.seed(seed)
data %>%
filter_taxa(abundance >= min_read_count) %>%
filter_taxa(name != "") %>% # Some taxonomic levels are not named
heat_tree(node_size = n_obs,
node_size_axis_label = "Number of OTUs",
node_color = abundance,
node_color_trans = "area",
node_color_axis_label = "Number of reads",
node_label = name,
node_label_max = 100,
overlap_avoidance = 1,
output_file = result_path(output_name))
}
plot_all(v35_data, "hmp--v35--all_data")
Here we can see that there are realtively few abundant taxa and many rare ones. However, we dont know anything about the species level diversity since the classifications go to genus only.
Stacked bar charts are commonly used to visualize community composition, with different colors corresponding to taxa. However, the use of color for categories limits the number of number of taxa displayed to the number of colors that are easily decernable, which is around 10. Therefore, rare taxa are often exculded from stackaed barcharts or only coarse taxonomic ranks are displayed. This might leave out biolicoally relevant information. The code below visulizes the community composition of the same 2 samples from the HMP using both a stacked barchart and heat trees so the two methods can be compared.
v35_data <- calculate_abundance(v35_data, "700114482", "sample_1")
v35_data <- calculate_abundance(v35_data, "700114439", "sample_2")
plot_sample_tree <- function(data, sample, out_name, min_count = min_read_count, ...) {
data %>%
mutate_taxa(sample = data$taxon_data[[sample]]) %>%
mutate_taxa(sample_prop = sample / max(sample)) %>%
filter_taxa(sample_1 >= min_count | sample_2 >= min_count) %>%
heat_tree(
...,
title_size = .05,
node_size = sample,
node_color = sample,
node_label = ifelse(sample_prop >= 0.01, name, NA),
node_label_max = 50,
node_size_range = c(.01, .07),
node_label_size_range = c(.015, .03),
node_color_axis_label = "Abundance",
output_file = result_path(out_name))
}
sample_1 <- plot_sample_tree(v35_data, "sample_1", "sample_1_tree", title = "Sample 1")
sample_2 <- plot_sample_tree(v35_data, "sample_2", "sample_2_tree", title = "Sample 2")
min_count <- 1
bar_data <- filter_taxa(v35_data, (sample_1 >= min_count | sample_2 >= min_count) & name != "") %>%
calculate_prop("700114482", "sample_1") %>%
calculate_prop("700114439", "sample_2") %>%
taxon_data()
bar_data <- bar_data[bar_data$rank == "p", c("name", "sample_1", "sample_2")]
bar_data <- reshape2::melt(bar_data, id.vars = "name")
bar_data <- bar_data[! bar_data$name %in% c("GN02", "SR1"), ]
library(ggplot2)
bar_plot <- ggplot(bar_data, aes(x = variable, y = value, fill = name)) +
geom_bar(stat = "identity") +
scale_x_discrete(labels = c("1", "2")) +
xlab("Sample") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_blank(),
axis.title.y=element_blank(),
axis.title.x=element_text(size=14),
axis.text.y=element_blank(),
axis.text.x=element_text(size=14, margin=margin(b = 6, t = -15)),
axis.ticks.y=element_blank(),
legend.title=element_blank(),
legend.text=element_text(size=12),
legend.position=c(1,1),
legend.justification=c(0, 1),
# legend.key.width=unit(1.2, "lines"),
plot.margin = unit(c(-0.35, 8, 0, 0.5), "lines"))
ggsave(bar_plot, filename = result_path("bar_chart"), dpi = 350, width = 8, height = 8)
library(gridExtra)
library(cowplot)
plot_name <- "barchart_heat_tree_comparison"
combined_plot <- plot_grid(bar_plot, sample_1, sample_2, ncol = 3, nrow = 1, rel_widths = c(1, 2, 2))
save_plot(filename = result_path(plot_name),
combined_plot, ncol = 3, nrow = 1, base_aspect_ratio = .9)
save_publication_fig(plot_name, figure_number = 2)
## [1] "results/barchart_heat_tree_comparison.pdf/revision_1/figure_2--barchart_heat_tree_comparison.pdf"
print(combined_plot)
Note how the two visulization methods can suggest different conclusions regarding how similar the two communities are. The stacked barcharts suggest that communities are dominated by Firmicutes, but sample 2 has additional Bacteriodetes and Proteobacteria. They look different, but not too different. The heat trees tell a much differnt story. Almost all of sample 1 is composed of two genera, Lactobacillus and Gardnerella, which are completly absent in sample 2. In fact, the samples do not even share any families, so the similarity suggested by the stacked barchart is misleading. We could have construcated the stacked barchart at the family level to see these differneces, but then there would be too many colors to easily distinguish.
To be fair, I did choose these two samples to prove this point. In other cases stacked barcharts might represent data well, but using heat trees instead of stacked barcharts will avoid these kind of misleading graphs.
The HMP dataset is great for comparing treatments since there are many body sites with many replicates so statistical tests can be used to find correlations between body sites and taxon read abundance (the relationship between read abundance and organism abundance is more fuzzy and open to debate). The code below applies the Wilcox rank-sum test to differences in median read proportion between every pair of body sties compared. Since the data is compositional in nature (i.e. not idependent samples) we used a non-parametric test and used median instead of mean read proportion.
calculate_prop <- function(data, site) {
sample_cols <- as.character(unlist(data$mapping[data$mapping$body_site == site, "sample_id"]))
sample_cols <- sample_cols[sample_cols %in% colnames(data$obs_data)]
obs_indexes <- obs(data)
total_counts <- vapply(sample_cols, function(s) sum(data$obs_data[, s]), numeric(1))
col_name <- paste0(site, "_median_prop")
lapply(obs_indexes,
function(i) {
vapply(sample_cols,
function(s) sum(data$obs_data[i, s]) / total_counts[s],
numeric(1))})
}
calculate_prop_diff <- function(data, site_1, site_2) {
remove_inf <- function(values) {
values[values == Inf] <- 10000000000000000000000000
values[values == -Inf] <- -10000000000000000000000000
values[is.nan(values)] <- 0
return(values)
}
props_1 <- calculate_prop(data, site_1)
props_2 <- calculate_prop(data, site_2)
p_value <- mapply(function(x, y) wilcox.test(x, y)$p.value,
props_1, props_2)
p_value <- p.adjust(p_value, method = "fdr", n = length(p_value) * 10) # * 10 HACK!
result <- ifelse(p_value > 0.05 | is.nan(p_value), 0,
log2(vapply(props_1, median, numeric(1)) / vapply(props_2, median, numeric(1))))
remove_inf(result)
}
plot_body_site_diff <- function(data, site_1, site_2, output_name, seed = 1) {
set.seed(seed)
data %>%
mutate_taxa(median_prop_diff = calculate_prop_diff(data, site_1, site_2)) %>%
filter_taxa(abundance >= min_read_count) %>%
filter_taxa(name != "") %>% # Some taxonomic levels are not named
heat_tree(node_size_axis_label = "Number of OTUs",
node_size = n_obs,
node_color_axis_label = "Log 2 ratio of median proportions",
node_color = median_prop_diff,
node_color_range = diverging_palette(),
node_color_trans = "linear",
node_color_interval = color_interval,
edge_color_interval = color_interval,
node_label = name,
node_label_max = 150,
output_file = result_path(paste0(output_name, "--", site_1, "_vs_", site_2)))
}
plot_body_site_diff(v35_data, "Throat", "Saliva", "hmp--v35")
plot_body_site_diff(v35_data, "Supragingival_plaque", "Subgingival_plaque", "hmp--v35")
plot_body_site_diff(v35_data, "Anterior_nares", "Buccal_mucosa", "hmp--v35")
We call these types of graphs differential heat trees and they are great for comparing any type of data associated with two samples or treatments.
The graphs above are great for comparing two treatments, but it would be nice to see how many treatments compare in a single graph. To do this, we have developed a pair-wise matrix layout for comparisons of this kind.
Note: There is a lot of non-trivial code below to do this. We are workign on wrapping all this into a function to make this kind of figure easier to make, but to do this well requires a rebuild of the taxmap object to make it more flexible.
First we need to calculate all of the information that will be plotted.
# Identify pairs of treatments to compare
body_sites <- c("Stool", "Saliva", "Tongue_dorsum", "Buccal_mucosa", "Anterior_nares")
combinations <- t(combn(seq_along(body_sites), 2))
site_diff_cols <- apply(combinations, MARGIN = 1, function(i) paste0(body_sites[i], collapse = "-"))
layout_matrix <- matrix(rep(NA, (length(body_sites))^2), nrow = length(body_sites))
for (index in 1:nrow(combinations)) {
layout_matrix[combinations[index,1], combinations[index,2]] <- index
}
# Make reduced taxonomy for plotting
v35_reduced <- v35_data
# Calculate significant body site differneces
not_used <- apply(combinations, MARGIN = 1, function(i) {
v35_reduced$taxon_data[[paste0(body_sites[i], collapse = "-")]] <<- calculate_prop_diff(v35_reduced, body_sites[i[1]], body_sites[i[2]])
})
# Remove data not needed for plotting
v35_reduced$obs_data <- v35_reduced$obs_data[, c("obs_taxon_ids", "otu_id")] # remove per-otu data since it is big
# Record taxa with no differences in any treatment
v35_reduced <- mutate_taxa(v35_reduced,
is_diff = apply(v35_reduced$taxon_data[ , site_diff_cols], MARGIN = 1,
function(a_row) any(a_row != 0)))
Next, we will make all of the individual differential heat trees and the guide tree and save them in a list.
# Make individual plots
plot_body_site_diff_small <- function(data, diff_col, seed = 1) {
set.seed(seed)
data %>%
mutate_taxa(diff_col = data$taxon_data[[diff_col]]) %>%
filter_taxa(n_supertaxa <= matrix_plot_depth) %>%
filter_taxa(abundance > 1000) %>%
filter_taxa(name != "") %>% # Some taxonomic levels are not named
heat_tree(node_size = n_obs,
node_color = diff_col,
node_color_range = diverging_palette(),
node_color_trans = "linear",
node_color_interval = color_interval,
edge_color_interval = color_interval,
overlap_avoidance = 0.7,
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Median proportion difference",
make_legend = FALSE)
}
sub_plots <- lapply(site_diff_cols, function(x) plot_body_site_diff_small(v35_reduced, x))
key_plot <- v35_reduced %>%
filter_taxa(n_supertaxa <= matrix_plot_depth) %>%
filter_taxa(abundance > 1000) %>%
filter_taxa(name != "") %>% # Some taxonomic levels are not named
heat_tree(node_size = n_obs,
node_color = "#DDDDDD",
node_color_range = diverging_palette(),
node_color_trans = "linear",
node_color_interval = color_interval,
node_label = ifelse(n_obs > 1000, name, NA),
edge_label = ifelse(n_obs > 1000, NA, name),
node_label_max = 500,
edge_label_max = 500,
edge_label_size_range = c(0.007, 0.02),
node_label_size_range = c(0.007, 0.02),
overlap_avoidance = 0.7,
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Log 2 ratio of median proportions",
make_legend = TRUE)
Finally we will calcaulte the coordinates for each plot and arrange them together.
calc_subplot_coords <- function(a_matrix, x1 = 0, y1 = 0, x2 = 1, y2 = 1) {
# lowerleft = c(x1, y1), upperright = c(x2, y2)
x_coords <- seq(from = x1, to = x2, length.out = ncol(a_matrix) + 1)[- (ncol(a_matrix) + 1)]
y_coords <- seq(from = y1, to = y2, length.out = nrow(a_matrix) + 1)[- (nrow(a_matrix) + 1)]
do.call(rbind, lapply(1:ncol(a_matrix), function(x) data.frame(plot_index = a_matrix[, x],
x = x_coords[x],
y = rev(y_coords))))
}
# remove empty column/row
layout_matrix <- layout_matrix[! apply(layout_matrix, MARGIN = 1, function(x) all(is.na(x))), ]
layout_matrix <- layout_matrix[ ,! apply(layout_matrix, MARGIN = 2, function(x) all(is.na(x)))]
# Get subplot layout data
matrix_data <- calc_subplot_coords(layout_matrix, x1 = 0.2, y1 = 0.2, x2 = 0.95, y2 = 0.95)
matrix_data$site <- site_diff_cols[layout_matrix]
matrix_data <- tidyr::separate(matrix_data, site, c("site_1", "site_2"), "-")
matrix_data <- matrix_data[!is.na(matrix_data$plot_index), ]
matrix_data <- matrix_data[order(matrix_data$plot_index), ]
rownames(matrix_data) <- matrix_data$plot_index
# Make label data
named_row <- which(apply(layout_matrix, MARGIN = 1, function(x) all(!is.na(x))))
named_col <- which(apply(layout_matrix, MARGIN = 2, function(x) all(!is.na(x))))
horz_label_data <- matrix_data[match(layout_matrix[named_row, ], matrix_data$plot_index), ]
vert_label_data <- matrix_data[match(layout_matrix[, named_col], matrix_data$plot_index), ]
subgraph_width <- abs(horz_label_data$x[1] - horz_label_data$x[2])
subgraph_height <- abs(vert_label_data$y[1] - vert_label_data$y[2])
horz_label_data$label_x <- horz_label_data$x + subgraph_width / 2 # center of label
horz_label_data$label_y <- 0.96 # bottom of label
vert_label_data$label_x <- 0.96 # bottom of rotated label
vert_label_data$label_y <- vert_label_data$y + subgraph_height / 2 # center of rotated label
# Make plot
library(cowplot)
library(metacoder)
label_size <- 12
matrix_plot <- ggdraw() +
draw_plot(key_plot, x = 0, y = -0.03, width = 0.78, height = 0.67) +
draw_text(gsub("_", " ", horz_label_data$site_2),
x = horz_label_data$label_x, y = horz_label_data$label_y,
size = label_size, colour = diverging_palette()[1],
hjust = "center", vjust = "bottom") +
draw_text(gsub("_", " ", vert_label_data$site_1),
x = vert_label_data$label_x, y = vert_label_data$label_y,
size = label_size, colour = diverging_palette()[3],
hjust = "center", vjust = "bottom", angle = -90)
for (i in seq_along(sub_plots)) {
matrix_plot <- matrix_plot + draw_plot(sub_plots[[i]],
x = matrix_data[i, "x"],
y = matrix_data[i, "y"],
width = subgraph_width, height = subgraph_height)
}
print(matrix_plot)
plot_name <- "figure_3--hmp_matrix_plot"
save_plot(result_path(plot_name),
matrix_plot, base_width = 7.5, base_height = 7.5)
save_publication_fig(plot_name, figure_number = 5)
## [1] "results/figure_3--hmp_matrix_plot.pdf/revision_1/figure_5--figure_3--hmp_matrix_plot.pdf"
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] cowplot_0.7.0 gridExtra_2.2.1 ggplot2_2.2.1
## [4] metacoder_0.1.2 knitcitations_1.0.7 knitr_1.14
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.9 formatR_1.4 plyr_1.8.4 bitops_1.0-6
## [5] tools_3.3.1 digest_0.6.12 lubridate_1.6.0 evaluate_0.10
## [9] tibble_1.2 gtable_0.2.0 bibtex_0.4.0 igraph_1.0.1
## [13] DBI_0.5-1 yaml_2.1.13 RefManageR_0.13.1 dplyr_0.5.0
## [17] httr_1.2.1 stringr_1.1.0 rprojroot_1.2 R6_2.2.0
## [21] XML_3.98-1.4 rmarkdown_1.3 RJSONIO_1.3-0 tidyr_0.6.1
## [25] reshape2_1.4.2 magrittr_1.5 backports_1.0.5 scales_0.4.1
## [29] htmltools_0.3.5 assertthat_0.1 colorspace_1.2-7 labeling_0.3
## [33] stringi_1.1.2 RCurl_1.95-4.8 lazyeval_0.2.0 munsell_0.4.3
Consortium, Human Microbiome Project, and others. 2012. “Structure, Function and Diversity of the Healthy Human Microbiome.” Nature 486 (7402). Nature Research: 207–14.
Comments